'data.frame': 12 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ wt : int 64 71 53 67 55 58 77 57 56 51 ...
$ ht : int 57 59 49 62 51 50 55 48 42 42 ...
$ age: int 8 10 6 11 8 7 10 9 10 6 ...
BSTA 6100 Fall 2025
2025-11-25
A study was carried out to examine the relationship between body weight (measured in pounds) and covariates age (years) and height (inches) among children aged 6-12.
First, we’ll use simple linear regression to estimate an association between height and weight.
Call:
lm(formula = wt ~ ht, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.8736 -3.8973 -0.4402 2.2624 11.8375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1898 12.8487 0.482 0.64035
ht 1.0722 0.2417 4.436 0.00126 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.471 on 10 degrees of freedom
Multiple R-squared: 0.663, Adjusted R-squared: 0.6293
F-statistic: 19.67 on 1 and 10 DF, p-value: 0.001263
Next, we’ll use simple linear regression to estimate an association between age and weight.
Call:
lm(formula = wt ~ age, data = d)
Residuals:
Min 1Q Median 3Q Max
-11.000 -3.911 1.143 4.071 10.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.5714 8.6137 3.549 0.00528 **
age 3.6429 0.9551 3.814 0.00341 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.015 on 10 degrees of freedom
Multiple R-squared: 0.5926, Adjusted R-squared: 0.5519
F-statistic: 14.55 on 1 and 10 DF, p-value: 0.003407
Estimate a 90% confidence interval for the association between age and weight.
Estimate a 95% confidence interval for the weight of a child who is 53 inches tall.
fit lwr upr
1 63.01806 59.49644 66.53968
Estimate a 95% prediction interval for the weight of a child who is 53 inches tall.
To fit a multiple regression model in R, add covariates to the formula object:
Call:
lm(formula = wt ~ ht + age, data = d)
Residuals:
Min 1Q Median 3Q Max
-6.8708 -1.7004 0.3454 1.4642 10.2336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5530 10.9448 0.599 0.5641
ht 0.7220 0.2608 2.768 0.0218 *
age 2.0501 0.9372 2.187 0.0565 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared: 0.78, Adjusted R-squared: 0.7311
F-statistic: 15.95 on 2 and 9 DF, p-value: 0.001099
Call:
lm(formula = wt ~ ht + age, data = d)
Residuals:
Min 1Q Median 3Q Max
-6.8708 -1.7004 0.3454 1.4642 10.2336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5530 10.9448 0.599 0.5641
ht 0.7220 0.2608 2.768 0.0218 *
age 2.0501 0.9372 2.187 0.0565 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared: 0.78, Adjusted R-squared: 0.7311
F-statistic: 15.95 on 2 and 9 DF, p-value: 0.001099
Interpret each of the parameters.
Call:
lm(formula = wt ~ ht + age, data = d)
Residuals:
Min 1Q Median 3Q Max
-6.8708 -1.7004 0.3454 1.4642 10.2336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5530 10.9448 0.599 0.5641
ht 0.7220 0.2608 2.768 0.0218 *
age 2.0501 0.9372 2.187 0.0565 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared: 0.78, Adjusted R-squared: 0.7311
F-statistic: 15.95 on 2 and 9 DF, p-value: 0.001099
Interpret the F test.
Given a fitted lm object, we can get estimated means / predicted outcomes using predict():
predict(mod3,
newdata = data.frame(
"ht" = mean(d$ht),
"age" = mean(d$age)
),
se.fit = T,
interval = "confidence", # or "prediction"
level = 0.95)$fit
fit lwr upr
1 62.75 59.70699 65.79301
$se.fit
[1] 1.345181
$df
[1] 9
$residual.scale
[1] 4.659845
The newdata argument is an optional data frame containing values of the covariates at which you want to predict the outcome. So the above code returns the estimated mean weight of a child with average height and age from the data: \hat{Y} = 6.553 + 0.7220(52.75) + 2.0501(8.833) = 62.75 \text{ lbs}.
Given a fitted lm object, we can get estimated confidence intervals for the parameters using the confint() function.
We can get CIs for all parameters:
2.5 % 97.5 %
(Intercept) -18.20587071 31.311967
ht 0.13205592 1.312020
age -0.07002526 4.170278
Or just some, using
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
sex year
1 male 2007
2 female 2007
3 female 2007
4 female 2007
5 male 2007
6 female 2007
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1083.08 -282.65 -17.18 242.95 1293.24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5836.299 312.604 -18.670 <2e-16 ***
bill_length_mm 4.959 5.214 0.951 0.342
flipper_length_mm 48.890 2.034 24.034 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 393.4 on 330 degrees of freedom
Multiple R-squared: 0.7627, Adjusted R-squared: 0.7613
F-statistic: 530.4 on 2 and 330 DF, p-value: < 2.2e-16
We can add or subtract covariates to an existing lm() object using update():
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-811.52 -227.81 -33.74 216.73 1053.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3864.073 533.592 -7.242 3.18e-12 ***
bill_length_mm 60.117 7.207 8.341 2.06e-15 ***
flipper_length_mm 27.544 3.209 8.583 3.77e-16 ***
speciesChinstrap -732.417 82.063 -8.925 < 2e-16 ***
speciesGentoo 113.254 89.206 1.270 0.205
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 339.6 on 328 degrees of freedom
Multiple R-squared: 0.8243, Adjusted R-squared: 0.8222
F-statistic: 384.7 on 4 and 328 DF, p-value: < 2.2e-16
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-811.52 -227.81 -33.74 216.73 1053.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3864.073 533.592 -7.242 3.18e-12 ***
bill_length_mm 60.117 7.207 8.341 2.06e-15 ***
flipper_length_mm 27.544 3.209 8.583 3.77e-16 ***
speciesChinstrap -732.417 82.063 -8.925 < 2e-16 ***
speciesGentoo 113.254 89.206 1.270 0.205
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 339.6 on 328 degrees of freedom
Multiple R-squared: 0.8243, Adjusted R-squared: 0.8222
F-statistic: 384.7 on 4 and 328 DF, p-value: < 2.2e-16
Interpret the species parameters.
Above, we fit a model reg2: Y_i = \beta_0 + \beta_1 \text{(bill length)}_i + \beta_2 \text{(flipper length)}_i + \beta_3 \text{chinstrap}_i + \beta_4 \text{gentoo}_i + \epsilon_i.
Consider testing \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0, the omnibus test of all parameters in the model. We can do this just from the R output for reg2:
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-811.52 -227.81 -33.74 216.73 1053.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3864.073 533.592 -7.242 3.18e-12 ***
bill_length_mm 60.117 7.207 8.341 2.06e-15 ***
flipper_length_mm 27.544 3.209 8.583 3.77e-16 ***
speciesChinstrap -732.417 82.063 -8.925 < 2e-16 ***
speciesGentoo 113.254 89.206 1.270 0.205
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 339.6 on 328 degrees of freedom
Multiple R-squared: 0.8243, Adjusted R-squared: 0.8222
F-statistic: 384.7 on 4 and 328 DF, p-value: < 2.2e-16
Now consider testing the null that \beta_3 = \beta_4 = 0, i.e., that mean penguin body mass does not vary by species, adjusted for bill length & flipper length. This is a covariate subset test. We can do this by fitting a reduced model and using the anova() function.
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1083.08 -282.65 -17.18 242.95 1293.24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5836.299 312.604 -18.670 <2e-16 ***
bill_length_mm 4.959 5.214 0.951 0.342
flipper_length_mm 48.890 2.034 24.034 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 393.4 on 330 degrees of freedom
Multiple R-squared: 0.7627, Adjusted R-squared: 0.7613
F-statistic: 530.4 on 2 and 330 DF, p-value: < 2.2e-16
Analysis of Variance Table
Model 1: body_mass_g ~ bill_length_mm + flipper_length_mm
Model 2: body_mass_g ~ bill_length_mm + flipper_length_mm + species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 330 51071963
2 328 37820194 2 13251769 57.464 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can use the glht() function from the multcomp package (note: install multcomp using install.packages("multcomp")) to test general linear hypotheses.
Let’s look at a model for penguin body mass against bill length, flipper length, and island.
reg4 <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm + island, data = penguins)
summary(reg4)
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
island, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1017.04 -264.91 -18.85 217.82 1370.07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4355.027 403.515 -10.793 < 2e-16 ***
bill_length_mm 16.883 5.601 3.014 0.00278 **
flipper_length_mm 39.656 2.535 15.645 < 2e-16 ***
islandDream -333.716 58.736 -5.682 2.95e-08 ***
islandTorgersen -190.960 70.748 -2.699 0.00731 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 376.5 on 328 degrees of freedom
Multiple R-squared: 0.784, Adjusted R-squared: 0.7814
F-statistic: 297.7 on 4 and 328 DF, p-value: < 2.2e-16
Let’s consider testing whether the difference in mean body mass on Dream and Torgersen islands is zero, adjusting for bill length & flipper length.
First, set up the hypothesis test:
H_0: \quad \text{vs.} \quad H_1:
To compare factor levels, use the mcp() function.
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: User-defined Contrasts
Fit: lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
island, data = penguins)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Dream - Torgersen == 0 -142.76 69.72 -2.047 0.0414 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
We could also specify a contrast vector or matrix:
(Intercept) bill_length_mm flipper_length_mm islandDream
-4355.02693 16.88279 39.65617 -333.71627
islandTorgersen
-190.95953
[,1] [,2] [,3] [,4] [,5]
Dream - Torgersen 0 0 0 1 -1
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
island, data = penguins)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Dream - Torgersen == 0 -142.76 69.72 -2.047 0.0414 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
With multiple covariates, the linearity assumption becomes even more critical. Not only do we have to correctly model the form of the relationship between Y and X_1, but also X_2, X_3, etc. We can use a plot of residuals vs. fitted values to check this assumption. We could also make partial regression plots to check.
Leverage is a measure of how much the data from the ith observation contribute to the fitted value \hat{Y}_i. High leverage means that X_i strongly determines \hat{Y}_i. Remember \hat{Y} = HY = {X}^\top \left({X}^\top {X}\right)^{-1} {X} {Y}.
The leverage for the ith observation is the (i,i)th element (the ith diagonal) of the hat matrix H, h_{ii} = x_{i}^\top (X^\top X)^{-1} x_{i}, and reflects outlyingness in X space.
The standardized residuals are scaled by an estimate of the errors’ variance. We could also scale by an estimate of the residuals’ variance. The
Remember,
\operatorname{Var}(\hat{\epsilon}) = \sigma^2(I_n - H),
where H = X^\top (X^\top X)^{-1} X is the hat matrix and I_n is the n\times n identity matrix. The diagonal elements of the hat matrix, denoted h_{ii} are called leverage. The variance of a single residual is \operatorname{Var}(\hat{\epsilon}_i) = \sigma^2 (1 - h_{ii}). The studentized residuals are
\hat{t}_{i} = \frac{\hat{\epsilon}_i}{\hat{\sigma}^2 \sqrt{1-h_{ii}}}.
The studentized residuals have constant variance 1 regardless of the location of x_i when the model is correctly specified. Studentized residuals are useful for examining outliers because they’re leverage-adjusted: they account for the fact that the variance of the residuals changes with X_i.
Studentized residuals are approximately distributed with a standard normal distribution: “large” values (per a standard normal distribution) indicate outlyingness.
The jackknife residuals are similar to studentized residuals, but refit to remove the ith individual:
\hat{r}_{(-i)} = \frac{\hat{\epsilon}_i}{\hat{\sigma}_{(-i)} \sqrt{1 - h_{ii}}},
where \hat{\sigma}_{(-i)}^2 is an estimator of \sigma^2 with the ith individual removed (i.e., that doesn’t use data from individual i).
Jackknife residuals completely remove the influence of the ith individual in determining whether it’s an outlier, because \hat{\epsilon}_i can mask its own outlyingness:
For points with high leverage, |\hat{\epsilon}_i| may be small (remember the variance of the residual is \sigma^2 (I - H))
If |\hat{\epsilon}_i| is large, \hat{\sigma}^2 might over-estimate \sigma^2, so things like the studentized residuals might be too small.
Influence measures the “pull” a single point has on a fitted regression line. Highly influential points draw the fitted line closer to them, possibly biasing results. They typically have both
high leverage
departure from trend in other data points
but one or the other does not necessarily imply influence. An influential point is one with high leverage and outlyingness.
One measure of influence is Cook’s Distance:
D_i = \frac{\left(\hat{\beta}_{(-i)}-\hat{\beta}\right)^\top \left(X^\top X\right)\left(\hat{\beta}_{(-i)} - \hat{\beta}\right)}{(p+1)\hat{\sigma}^2}.
Cook’s distance is a scaled distance between the fitted regression coefficients after removing observation i and the fitted regression coefficients that use observation i. If D_i > 1, generally we’ll say the observation is influential.
We can plot Cook’s distance for all points in the model using the which argument to plot.lm():